Chapter 10
Statistics and Causation
Kullback (1959) points out that information theory is a branch of the mathematical
theory of probability and statistics; insofar as bioinformatics is a branch of informa-
tion theory, it follows that statistics is superordinate to bioinformatics. As such, it is
clearly beyond the scope of this book to expound statistics, for which many excellent
texts exist. 1 Nevertheless, a few words might be useful, if only to set bioinformatics
within it statistical context.
10.1
A Brief Outline of Statistics
Science is rarely concerned about a single number, and Galileo showed how to
make sense of numerical data, observational or experimental—a collection of num-
bers pertaining to a phenomenon, meaningless without some kind of interpreta-
tion (i.e., a model, and ultimately mathematical equations linking those numbers).
Bernoulli (1777) resolved the vexing question of how to deal with apparent outliers.
Descartes gave us graphical, coördinate-based representation of data, and much prac-
tical statistics is indeed concerned with how best to present numerical data visually
(cf. Sect. 13.4).
One often wishes to compare two or more sets of data and determine whether
there is a significant difference between them. Chapter 9 has already given us various
quantities that might be extracted from a dataset; a simple and widely used test for
significance of the difference of means is to determine the ratio of the variance
between groups to the variance within groups (ANOVA or analysis of variance); the
difference is significant if the ratio is much greater than≫1. Support for propositions (hypotheses) is
discussed in Chap. 9. Often one of the datasets is that which would be generated
by chance; Polya (1954) gives an excellent exposition in his chapter “Chance, the
ever-present rival conjecture”.
1 Freedman (2009) is especially recommended.
© Springer Nature Switzerland AG 2023
J. Ramsden, Bioinformatics, Computational Biology,
https://doi.org/10.1007/978-3-030-45607-8_10
115